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Abstract 

In this paper, we introduce a low-complexity approximation for the discrete Tchebichef transform 
(DTT). The proposed forward and inverse transforms are multiplication-free and require a reduced num¬ 
ber of additions and bit-shifting operations. Numerical compression simulations demonstrate the effi¬ 
ciency of the proposed transform for image and video coding. Furthermore, Xilinx Virtex-6 FPGA based 
hardware realization shows 44.9% reduction in dynamic power consumption and 64.7% lower area when 
compared to the literature. 
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1 Introduction 

The discrete Tchebichef transform (DTT) is a useful tool for signal coding and data decorrelation |[T1. In 
recent years, signal processing literature has employed the DTT in several image processing problems, such 
as artifact measurement f2l, blind integrity verification f33, and image compression l|4H3. In particular, the 
8-point DTT has been considered in blind forensics for integrity check of medical images 0. For image 
compression, the 8-point DTT is also capable of outperforming the 8-point discrete cosine transform (DCT) 
in terms of average bit-length in bitstream codification [A]. Moreover, in fTl an 8-point DTT-based encoder 
capable of improved image quality and reduced encoding/decoding time was proposed; being a competitor 
to state-of-the-art DCT-based methods. However, to the best of our knowledge, literature archives only one 
fast algorithm for the 8-point DTT, which requires a significant number of arithmetic operations Q. Such 
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high arithmetic complexity may be a hindrance for the adoption of the DTT in contemporary devices that 
demand low-complexity circuitry and low power consumption lISUTOl. 

An alternative to the exact transform computation is the employment of approximate transforms. Such 
approach has been successfully applied to the exact DCT, resulting in several approximations II111I12L In 
general, an approximate transform consists of a low-complexity matrix with elements defined over a set 
of small integers, such as {0,±1,±2,±3}. The resulting matrix possesses null multiplicative complexity, 
because the involved arithmetic operations can be implemented exclusively by means of a reduced number of 
additions and bit-shifts. Prominent examples of approximate transforms include: the signed DCT ifTTl . the 
series of DCT approximations by Bouguezel-Ahmed-Swamy IHHlSl, the approximation by Lengwehasatit- 
Ortega |[T7]| . and the integer based approximations described in lllllll21[T8l[T9l . 

In this work, we introduce a low-complexity DTT approximation that requires 54.5% less additions than 
the exact DTT fast algorithm. The proposed method is suitable for image and video coding, capable of 
processing data coded according to popular standards—such as JPEG ll20l . H.264 ll^ . and HEVC ll22l — 
at a low computational cost. Moreover, the EPGA hardware realization of the proposed transform is also 
sought. 

This paper unfolds as follows. Section |2] describes the DTT and introduces the approximate DTT with 
its associate fast algorithm. A computational complexity analysis is offered. In Section |3l we perform 
numerical experiments; applying of the proposed transform as a tool for image and video compression. 
In Section IH we provide very large scale integration (VESI) realizations of the exact DTT and proposed 
approximation. Conclusions and final remarks are in Seclion|5] 


2 Discrete Tchebichef Transform Approximation 

2.1 Exact Discrete Tchebichef Transform 

The DTT is an orfhogonal Iransformalion derived from fhe discrete Tchebichef polynomials 
enfries of fhe A-poinf DTT mafrix are furnished by lUl : 


. The 




1-A;E 


k,n = 0,l,...,N-l, (1) 


where jF 2 {ai,a 2 ,a 3 -,bi,b 2 \z) = ' H hypergeomefric funcfion and {a)^ = 

\)---{a + k— \) is fhe ascending faclorial. Therefore, fhe analysis and synfhesis equafions for fhe DTT 


are given by X = T • x and x = T ^ • X = T^ • X, where x = 


1 T 


XQ Xl 


■XN-1 


is fhe inpuf signal. 


X = 


Xo Xl 


Xiv-i is fhe Iransformed signal, and T is fhe A-poinf DTT mafrix wifh elemenfs 
k,n = 0,1,... ,N — 1, 

In parficular, fhe 8-poinf DTT mafrix T can be described by fhe producf of a diagonal mafrix F and an 
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integer-entry matrix To [i6|, resulting in: T = F • Tq, where 


To = 


■ 1 1 1 1 1 1 1 

-7 - 5 - 3-1 13 5 

7 1 - 3-5 -5 -3 1 

-7 5 7 3 -3 -7 -5 

7 -13 -3 9 9 -3 -13 

-7 23 -17 -15 15 17 -23 

1 -5 9 -5 -5 9 -5 

.-1 7 -21 35 -35 21 -7 


r 

7 

7 

7 

7 ’ 
7 
1 

1 . 


( 2 ) 


and F = 1 • diag ^ j. A fast algorithm for the above integer matrix 

To = F^^ -T was derived in 0 requiring 44 additions and 29 bit-shifting operations. Such arithmetic 
complexity is considered excessive, when compared to state-of-the-art discrete transform approximations 
which generally require less than 24 additions lll21ll31[T6l[T7l . 


2.2 DTT Approximation and Fast Algorithm 

In llT2ll . a class of DCT approximations was introduced based on the following relation: round(a • C), where 
round(-) is the round function as defined in C and Matlab languages lfT2l . a is a real parameter, and C is the 
exact DCT matrix. We aim at proposing a similar approach to obtain an 8-point DTT approximation. The 
scale-and-round approach is particularly effective when discrete trigonometric transforms are considered. 
This is because the entries of such transformation matrices have smaller dynamic ranges when compared to 
the DTT. In contrast, the DTT entries have values with a dynamic range roughly seven times larger than the 
DCT, for example. Thus the approximation error implied by the round function is less evenly distributed in 
non-trigonometric transform matrices, such as the DTT. To mitigate this effect, we propose a compading-like 
operation |[24]| . consisting of a rescaling matrix D that normalizes the DTT matrix entries. Thus, according 
the formalism detailed in |[T2l . we introduce a parametric family of approximate DTT matrices T(a), which 
are given by: 


T(a) = round (a • T • Dq) , 


(3) 


where Do = diag(yf, ). 

We aim at identifying a particular optimal parameter a* such that T* = T(a*) results in a matrix satis¬ 
fying the following constraints: (i) the entries of T* must be defined over {—1,0,1} and (ii) T* must possess 
low arithmetic complexity. Constraint (i) implies the search space (0,3/2). Although the above problem is 
not analytically tractable, its solution can be found by exhaustive search ifT^ . By taking the values of a over 
the considered interval in steps of 10^^, above conditions are satisfied for 0.931 < a* < 0.957. All values 
of a* in this latter interval imply the same approximate matrix. Thus, the obtained low-complexity forward 
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DTT approximation is given by: 


1 1 1 1 1 1 1 r 

-1 -1 0 0 0 0 1 1 

1 0 0 -1 -1 0 0 1 

-1 1 1 0 0 - 1-1 1 
0 - 10110-10 
0 1 - 1-1 1 1-1 0 
0 - 11001-10 
0 0 -1 1 -1 1 0 0 . 


( 4 ) 


and its inverse T* is given by: (T*) ^ = Ti • Di where 


-1 — 


1 -3 
1 -2 
1 -1 -I 
1 -1 -1 
1 1 
1 1 
1 2 -1 
1 3 3 


3 -2 
I 2 


1 -1 -1 -r 
11-11 
1-1-2 3 -2 


1 

1 -1 
1 -1 
-2 
2 


1 -2 -1 
1 2 -1 
-12 3 

-1 -1 -1 
1 1 -1 


(5) 


and Di = diag Considering the total energy error UlllTll between the exact and 

approximate matrices, we obtained 3.32 and 4.86 as the error values for the direct and inverse transforma¬ 
tions, respectively. Such errors are considered very small |[T9ll . 

Thus, employing the orthogonalization procedure described in |[T2]| . we obtain the follow¬ 
ing expression for the DTT approximation: T = D* • T*, where D* = y^ediag (T* • (T*)^) = 

1 ^ 

cIq d\ dl dl d^ dj is a diagonal matrix and ediag(-) returns a diagonal matrix with 

the diagonal elements of its matrix argument |[T2l . The inverse transformation is (T)^^ = (D* = 

(T*)-i . = Ti • Dl ■ ''r»*3-i 


(D*) T Therefore, the analysis and synthesis equations for the proposed trans- 

1 T 


is the approximate 


form are given by X = T-x andX = Ti-Dl • (D*)^Cx, where X = Xq Xi ■■■ Xj 
transformed vector. 

However, in several contexts, diagonal matrices—such as Di and D*—represent only scaling factors 
and may not contribute to the computational cost of transformations. For instance, in JPEG-based im¬ 
age compression applications, diagonal matrices can be embedded into quantization block |l6j[TTl[T2l and, 
when the explicit transform coefficients are needless, a scaled version of the transform-domain spectrum 
is sufficient |[25ll . Therefore, hereafter, we disregard the diagonal matrices and focus our analysis on the 
low-complexity matrices T* and Ti. A fast algorithm based on sparse matrix factorization II111I121IT41 was 
derived for the proposed forward and inverse approximations. In Figure [T] the signal flow graph (SFG) for 
the direct transformation is depicted. The SFG for the inverse transformation can be obtained according to 
the methods described in ll26]l . Moreover, Table [T] summarizes the arithmetic complexity assessment for the 
proposed transformations. The fast algorithms for T* and Ti demand 54.5% and 34.1% less additions than 
the DTT fast algorithm (ITT) proposed in f6|, respectively. 
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Figure 1: Signal flow graph for T*. Input data n = 0,1,... ,7, relates to the output Xk, k = 0,1,... ,1. 
Dashed arrows represent multiplications by —1. Scaling by d^, k = 0,1,... ,7, can be ignored and absorbed 
into the quantization step. 



Table 1: Arithmetic complexity of the proposed 1-D transforms 


Method 

Mult. 

Additions 

Shifts 

Total 

Exact DTT [ig] 

0 

44 

29 

73 

Proposed T* 

0 

20 

0 

20 

Proposed T i 

0 

29 

8 

37 


3 Experimental Results 

3.1 Image Compression 

In order to assess the proposed transform in image compression applications, we performed a JPEG-like 
simulation based on II61I11II12L A set of 45 512x512 8-bit grayscale images obtained from a standard public 
image bank fTH was considered. Each image was subdivided into 8x8 size blocks Xij, i,j = 1,2,... ,64. 
Each block is submitted to two-dimensional (2-D) versions of the discussed transformations according to: 
Bij = M • Aij ■ M^, where B,-, j is the transform-domain block and M G {T,T*} The resulting 64 spectral co¬ 
efficients of each block were ordered in the standard zigzag sequence. Subsequently, the r initial coefficients 
in each block were retained and the remaining coefficients were discarded IT^ . We adopted 1 < r < 45. 
Einally, each transform-domain subimage was submitted to inverse 2-D transformations and the full image 
was reconstructed. Image quality measures were employed to assess the degradation between original and 
reconstructed images. The considered measures were the structural similarity index (SSIM) ESI and the 
spectral residual base similarity (SR-SIM) E9l . These measures have the distinction of being consistent 
with subjective ratings E9l[30l . The peak signal-to-noise ratio (PSNR) was not considered as a figure of 
merit because of its limited capability of capturing the human perception of image fidelity and quality lISTI . 
Eor each value of r, we considered average measures across all considered images. Such methodology is less 
prone to variance effects and fortuitous data. Pigure|2]shows the resulting SSIM and SR-SIM measurements. 
The proposed transform performed very closely to the exact DTT. Eor qualitative purposes, Eigure[3]shows 
compressed images according to the DTT and the proposed approximation for r = 6; images are visually 


5 





















(a)DTT, r = 6 (b)t*,r = 6 

Figure 3: Compressed ‘Lena’ image for r = 6 by means of the (a) DTT and (b) the proposed approximation, 
indistinguishable. 

3.2 Video Compression 

With the objective of assessing the proposed transform performance in video coding, we have embedded 
the proposed DTT approximation in the widely employed software library x264 Il32ll for encoding video 
streams into the H.264/AVC standard 11211 . The 8-point transform employed in H.264/AVC is an integer 
approximation of the DCT that demands 32 additions and 14 bit-shifting operations |[33ll . In comparison, 
the proposed 8-point direct transform requires 38% less additions and no bit-shifting operations, while the 
proposed inverse transform requires 9% less additions and 43% less bit-shifting operations. We encoded 
eleven CIF videos with 300 frames at 25 frames per second from a public video database |[34l with the 
standard and the modified libraries. In our simulation, we employed default settings and controlled the 
video quality by two different approaches: (i) target bitrate, varying from 100 to 500 kbps with a step of 
50 kbps and (ii) quantization parameter (QP), varying from 5 to 50 with steps of 5 units. For video quality 
assessment, we submitted the luma component of the video frames to average SSIM evaluation relative 
to the Y component (luminance). Results are shown in Figure H] Even in scenarios of high compression 
(low bitrate/high QP), the degradation related to the proposed approximation is in the order of 0.01 units of 
SSIM; therefore, very low. Figure |5] displays the first encoded frame of a standard video sequence at low 
target bitrate (200 kbps). The resulting compressed frames are visually indistinguishable. 
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Figure 4: Video quality assessment in terms of (a) fixed target bitrate and (b) quantization parameter. 




(a) H.264/AVC (b) Modified H.264/AVC 


Figure 5: First frame of the compressed sequence ‘Foreman’ according to (a) the original H.264/AVC and 
(b) modified H.264/AVC with the proposed approximation. 

4 VLSI Architectures 

To compare hardware resource consumption of the proposed approximate DTT against the exact DTT pro¬ 
posed in 161], the 1-D version of both algorithms were initially modeled and tested in Matlab Simulink and 
then were physically realized on a Xilinx Virtex-6 XC6VLX240T- IFFGl 156 field programmable gafe array 
(FPGA) device and validafed using hardware-in-fhe-loop fesfing fhrough fhe JTAG inlerface. Bofh approxi- 
mafions were verified using more fhan 10000 fesf vectors wifh complete agreemenf wifh fheorefical values. 
Resulfs are shown in Table [H Mefrics, including configurable logic blocks (CLB) and flip-flop (FF) counf, 
crifical pafh delay (CPD, in ns), and maximum operafing frequency (fmax. in MHz) are provided. In addi- 
fion, sfafic {Qp, in mW) and frequency normalized dynamic power {Dp, in mW/MHz) consumpfions were 
esfimafed using fhe Xilinx XPower Analyzer. The final fhroughpuf of fhe 1-D DTT was 438.68 x 10^ 8- 
poinf Iransformalions/second, wifh a pixel rale of 3.509 x 10^ pixels/second. The percenlage reduclion in 
fhe number of CLBs and FFs was 64.7% and 71%, respectively. The dynamic power consumpfion Dp of 
fhe proposed archifecture was 44.9% lower. The figures of merif area-lime {AT) and area-lime^ {AT^) had 
percenlage reductions of 66.1% and 67.5% when compared wifh fhe exacf DTT @. 

5 Conclusion 

In Ihis paper, a low-complexily approximalion for fhe 8-poinl DTT was proposed. The arilhmefic cosl of 
fhe proposed approximation are significanlly low, when compared wifh fhe exacf DTT. Al fhe same lime, 
fhe proposed fool is very close fo fhe DTT in terms of image coding for a wide range of compression rates. 
In video compression, the introduced approximation was adapted into the popular codec H.264 furnishing 
virtually identical results at a much less computational cost. Our goal with the codec experimentation is not 
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Table 2: Resource consumption on Xilinx XC6VLX240T-1FFG1156 device 


Resource 

Method 

Exact DTT fg] 

Proposed 

CEB (A) 

408 

144 

EE 

1370 

396 

CPD (T) (ns) 

2.390 

2.290 

E^^ (MHz) 

418.41 

438.68 

AT 

975.1 

329.7 

AT^ 

2330.5 

755.1 

Dp (mW/MHz) 

5.10 

2.81 

Qpm 

3.44 

3.44 


to suggest the modification of an existing standard. Our objective is to demonstrate the capabilities of the 
proposed low-complexity transform in asymmetric codecs lIMl . Such codecs are employed when a video 
is encoded once but decoded several times in low power devices II351I36II . Additionally, the proposed trans¬ 
form can be considered in distributed video coding (DVC) II361I37L where the computational complexity is 
concentrated in the decoder. A relevant context for DVC is in remote sensors and video systems that are con¬ 
strained in terms of power, bandwidth, and computational capabilities 1361. The proposed approximation is 
a viable alternative to the DTT; possessing low-complexity and good performance according to meaningful 
image quality measures. Moreover, the associated hardware realization consumed roughly 1/3 of the area 
required by the exact DTT; also the dynamic power consumption was decreased by 44.9%. Future work in 
this field may consider the evaluation of DTT approximations in quantization schemes HHH. 
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